{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "# Entity Service Similarity Scores Output\n", "\n", "This tutorial demonstrates generating CLKs from PII, creating a new project on the entity service, and how to retrieve the results. \n", "The output type is raw similarity scores. This output type is particularly useful for determining a good threshold for the greedy solver used in mapping.\n", "\n", "The sections are usually run by different participants - but for illustration all is carried out in this one file. The participants providing data are *Alice* and *Bob*, and the analyst is acting as the integration authority.\n", "\n", "### Who learns what?\n", "\n", "Alice and Bob will both generate and upload their CLKs.\n", "\n", "The analyst - who creates the linkage project - learns the `similarity scores`. Be aware that this is a lot of information and are subject to frequency attacks.\n", "\n", "### Steps\n", "\n", "* Check connection to Entity Service\n", "* Data preparation\n", " * Write CSV files with PII\n", " * Create a Linkage Schema\n", "* Create Linkage Project\n", "* Generate CLKs from PII\n", "* Upload the PII\n", "* Create a run\n", "* Retrieve and analyse results" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import json\n", "import os\n", "import time\n", "\n", "import matplotlib.pyplot as plt\n", "import requests\n", "import clkhash.rest_client\n", "from IPython.display import clear_output" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Check Connection\n", "\n", "If you are connecting to a custom entity service, change the address here." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Testing anonlink-entity-service hosted at https://testing.es.data61.xyz\n" ] } ], "source": [ "url = os.getenv(\"SERVER\", \"https://testing.es.data61.xyz\")\n", "print(f'Testing anonlink-entity-service hosted at {url}')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"project_count\": 2115, \"rate\": 7737583, \"status\": \"ok\"}\r\n" ] } ], "source": [ "!clkutil status --server \"{url}\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Data preparation\n", "\n", "Following the [clkhash tutorial](http://clkhash.readthedocs.io/en/latest/tutorial_cli.html) we will use a dataset from the `recordlinkage` library. We will just write both datasets out to temporary CSV files.\n", "\n", "If you are following along yourself you may have to adjust the file names in all the `!clkutil` commands." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "from tempfile import NamedTemporaryFile\n", "from recordlinkage.datasets import load_febrl4" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
given_namesurnamestreet_numberaddress_1address_2suburbpostcodestatedate_of_birthsoc_sec_id
rec_id
rec-1070-orgmichaelaneumann8stanley streetmiamiwinston hills4223nsw191511115304218
rec-1016-orgcourtneypainter12pinkerton circuitbega flatsrichlands4560vic191612144066625
rec-4405-orgcharlesgreen38salkauskas crescentkeladapto4566nsw194809304365168
\n", "
" ], "text/plain": [ " given_name surname street_number address_1 \\\n", "rec_id \n", "rec-1070-org michaela neumann 8 stanley street \n", "rec-1016-org courtney painter 12 pinkerton circuit \n", "rec-4405-org charles green 38 salkauskas crescent \n", "\n", " address_2 suburb postcode state date_of_birth \\\n", "rec_id \n", "rec-1070-org miami winston hills 4223 nsw 19151111 \n", "rec-1016-org bega flats richlands 4560 vic 19161214 \n", "rec-4405-org kela dapto 4566 nsw 19480930 \n", "\n", " soc_sec_id \n", "rec_id \n", "rec-1070-org 5304218 \n", "rec-1016-org 4066625 \n", "rec-4405-org 4365168 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfA, dfB = load_febrl4()\n", "\n", "a_csv = NamedTemporaryFile('w')\n", "a_clks = NamedTemporaryFile('w', suffix='.json')\n", "dfA.to_csv(a_csv)\n", "a_csv.seek(0)\n", "\n", "b_csv = NamedTemporaryFile('w')\n", "b_clks = NamedTemporaryFile('w', suffix='.json')\n", "dfB.to_csv(b_csv)\n", "b_csv.seek(0)\n", "\n", "dfA.head(3)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Schema Preparation\n", "\n", "The linkage schema must be agreed on by the two parties. A hashing schema instructs clkhash how to treat each column for generating CLKs. A detailed description of the hashing schema can be found in the api docs. We will ignore the columns ‘rec_id’ and ‘soc_sec_id’ for CLK generation." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "schema = NamedTemporaryFile('wt')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting /tmp/tmpvlivqdcf\n" ] } ], "source": [ "%%writefile {schema.name}\n", "{\n", " \"version\": 1,\n", " \"clkConfig\": {\n", " \"l\": 1024,\n", " \"k\": 30,\n", " \"hash\": {\n", " \"type\": \"doubleHash\"\n", " },\n", " \"kdf\": {\n", " \"type\": \"HKDF\",\n", " \"hash\": \"SHA256\",\n", " \"info\": \"c2NoZW1hX2V4YW1wbGU=\",\n", " \"salt\": \"SCbL2zHNnmsckfzchsNkZY9XoHk96P/G5nUBrM7ybymlEFsMV6PAeDZCNp3rfNUPCtLDMOGQHG4pCQpfhiHCyA==\",\n", " \"keySize\": 64\n", " }\n", " },\n", " \"features\": [\n", " {\n", " \"identifier\": \"rec_id\",\n", " \"ignored\": true\n", " },\n", " {\n", " \"identifier\": \"given_name\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"surname\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"street_number\",\n", " \"format\": { \"type\": \"integer\" },\n", " \"hashing\": { \"ngram\": 1, \"positional\": true, \"weight\": 1, \"missingValue\": {\"sentinel\": \"\"} }\n", " },\n", " {\n", " \"identifier\": \"address_1\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"address_2\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"suburb\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\" },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"postcode\",\n", " \"format\": { \"type\": \"integer\", \"minimum\": 100, \"maximum\": 9999 },\n", " \"hashing\": { \"ngram\": 1, \"positional\": true, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"state\",\n", " \"format\": { \"type\": \"string\", \"encoding\": \"utf-8\", \"maxLength\": 3 },\n", " \"hashing\": { \"ngram\": 2, \"weight\": 1 }\n", " },\n", " {\n", " \"identifier\": \"date_of_birth\",\n", " \"format\": { \"type\": \"integer\" },\n", " \"hashing\": { \"ngram\": 1, \"positional\": true, \"weight\": 1, \"missingValue\": {\"sentinel\": \"\"} }\n", " },\n", " {\n", " \"identifier\": \"soc_sec_id\",\n", " \"ignored\": true\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Create Linkage Project\n", "\n", "The analyst carrying out the linkage starts by creating a linkage project of the desired output type with the Entity Service.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Credentials will be saved in /tmp/tmpcwpvq6kj\n", "\u001b[31mProject created\u001b[0m\n" ] }, { "data": { "text/plain": [ "{'project_id': '1eb3da44f73440c496ab42217381181de55e9dcd6743580c',\n", " 'result_token': '846c6c25097c7794131de0d3e2c39c04b7de9688acedc383',\n", " 'update_tokens': ['52aae3f1dfa8a4ec1486d8f7d63a8fe708876b39a8ec585b',\n", " '92e2c9c1ce52a2c2493b5e22953600735a07553f7d00a704']}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "creds = NamedTemporaryFile('wt')\n", "print(\"Credentials will be saved in\", creds.name)\n", "\n", "!clkutil create-project --schema \"{schema.name}\" --output \"{creds.name}\" --type \"similarity_scores\" --server \"{url}\"\n", "creds.seek(0)\n", "\n", "with open(creds.name, 'r') as f:\n", " credentials = json.load(f)\n", "\n", "project_id = credentials['project_id']\n", "credentials" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "**Note:** the analyst will need to pass on the `project_id` (the id of the linkage project) and one of the two `update_tokens` to each data provider.\n", "\n", "## Hash and Upload\n", "\n", "At the moment both data providers have *raw* personally identiy information. We first have to generate CLKs from the raw entity information. Please see [clkhash](https://clkhash.readthedocs.io/) documentation for further details on this." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.06kclk/s, mean=883, std=33.6]\n", "\u001b[31mCLK data written to /tmp/tmpj8m1dvxj.json\u001b[0m\n", "generating CLKs: 100%|█| 5.00k/5.00k [00:01<00:00, 1.30kclk/s, mean=875, std=39.7]\n", "\u001b[31mCLK data written to /tmp/tmpi2y_ogl9.json\u001b[0m\n" ] } ], "source": [ "!clkutil hash \"{a_csv.name}\" horse staple \"{schema.name}\" \"{a_clks.name}\"\n", "!clkutil hash \"{b_csv.name}\" horse staple \"{schema.name}\" \"{b_clks.name}\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Now the two clients can upload their data providing the appropriate *upload tokens*." ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "### Alice uploads her data" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile('wt') as f:\n", " !clkutil upload \\\n", " --project=\"{project_id}\" \\\n", " --apikey=\"{credentials['update_tokens'][0]}\" \\\n", " --server \"{url}\" \\\n", " --output \"{f.name}\" \\\n", " \"{a_clks.name}\"\n", " res = json.load(open(f.name))\n", " alice_receipt_token = res['receipt_token']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Every upload gets a receipt token. In some operating modes this receipt is required to access the results." ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "### Bob uploads his data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile('wt') as f:\n", " !clkutil upload \\\n", " --project=\"{project_id}\" \\\n", " --apikey=\"{credentials['update_tokens'][1]}\" \\\n", " --server \"{url}\" \\\n", " --output \"{f.name}\" \\\n", " \"{b_clks.name}\"\n", " \n", " bob_receipt_token = json.load(open(f.name))['receipt_token']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Create a run\n", "\n", "Now the project has been created and the CLK data has been uploaded we can carry out some privacy preserving record linkage. Try with a few different threshold values:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "with NamedTemporaryFile('wt') as f:\n", " !clkutil create \\\n", " --project=\"{project_id}\" \\\n", " --apikey=\"{credentials['result_token']}\" \\\n", " --server \"{url}\" \\\n", " --threshold 0.9 \\\n", " --output \"{f.name}\"\n", " \n", " run_id = json.load(open(f.name))['run_id']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "## Results\n", "\n", "Now after some delay (depending on the size) we can fetch the mask.\n", "This can be done with clkutil:\n", "\n", " !clkutil results --server \"{url}\" \\\n", " --project=\"{credentials['project_id']}\" \\\n", " --apikey=\"{credentials['result_token']}\" --output results.txt\n", " \n", "However for this tutorial we are going to use the `clkhash` library:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "State: completed\n", "Stage (2/2): compute similarity scores\n", "Progress: 1.000%\n" ] } ], "source": [ "for update in clkhash.rest_client.watch_run_status(url, project_id, run_id, credentials['result_token'], timeout=300):\n", " clear_output(wait=True)\n", " print(clkhash.rest_client.format_run_status(update))\n", "time.sleep(3)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "data = json.loads(clkhash.rest_client.run_get_result_text(\n", " url, \n", " project_id, \n", " run_id, \n", " credentials['result_token']))['similarity_scores']" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "This result is a large list of tuples recording the similarity between all rows above the given threshold." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[76, 2345, 1.0]\n", "[83, 3439, 1.0]\n", "[103, 863, 1.0]\n", "[154, 2391, 1.0]\n", "[177, 4247, 1.0]\n", "[192, 1176, 1.0]\n", "[270, 4516, 1.0]\n", "[312, 1253, 1.0]\n", "[407, 3743, 1.0]\n", "[670, 3550, 1.0]\n" ] } ], "source": [ "for row in data[:10]:\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Note there can be a lot of similarity scores:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": [ "1572906" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "We will display a *sample* of these similarity scores in a histogram using matplotlib:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD8CAYAAAB+UHOxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAElFJREFUeJzt3W+QnWd53/HvD9kmbSG1HG89RhJZN4hpxYsIujWmKY0Lgy3saQVtSkynQbieKpnYM2EmeSGSF05JPeO0BQYmxFMnVjFMwHESUjSxUqM4MDQdjC2DMZZVx4sRYynCViJD8DClkXP1xbkFJ2JXe3b37Dla39/PzJl9zvX8OfelI52fnj/n2VQVkqT+vGjaA5AkTYcBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASerUedMewNlcfPHFNTs7O+1hSNK68tBDD/15Vc0stdw5HQCzs7McPHhw2sOQpHUlyddGWc5DQJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTSwZAkh9I8kCSLyU5lOQ/tvplST6fZD7Jbye5oNVf3J7Pt/mzQ9t6d6s/nuTqtWpKkrS0UfYAvgO8oap+FNgO7EhyBfCrwPur6hXAs8ANbfkbgGdb/f1tOZJsA64DXgXsAH49yYZxNiNJGt2SAVADz7Wn57dHAW8AfrfV7wTe0qZ3tue0+W9Mkla/q6q+U1VfBeaBy8fShSRp2Ub6JnD7n/pDwCuADwFfAb5RVafaIkeBTW16E/AUQFWdSvJN4Ida/f6hzQ6vsyZm99yzYP3Irdeu5ctK0row0kngqnq+qrYDmxn8r/0frNWAkuxOcjDJwRMnTqzVy0hS95Z1FVBVfQP4NPA64MIkp/cgNgPH2vQxYAtAm/93gb8Yri+wzvBr3F5Vc1U1NzOz5L2MJEkrNMpVQDNJLmzTfwt4E3CYQRD8RFtsF/DJNr2vPafN/+Oqqla/rl0ldBmwFXhgXI1IkpZnlHMAlwJ3tvMALwLurqo/SPIYcFeS/wR8EbijLX8H8NEk88BJBlf+UFWHktwNPAacAm6squfH244kaVRLBkBVPQK8eoH6kyxwFU9V/V/g3yyyrVuAW5Y/TEnSuPlNYEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ1aMgCSbEny6SSPJTmU5Oda/ZeTHEvycHtcM7TOu5PMJ3k8ydVD9R2tNp9kz9q0JEkaxXkjLHMK+Pmq+kKSlwIPJTnQ5r2/qv7r8MJJtgHXAa8CXgb8UZJXttkfAt4EHAUeTLKvqh4bRyPLMbvnngXrR269dsIjkaTpWTIAquo4cLxNfyvJYWDTWVbZCdxVVd8BvppkHri8zZuvqicBktzVlp14AEiSlnkOIMks8Grg8610U5JHkuxNsrHVNgFPDa12tNUWq5/5GruTHExy8MSJE8sZniRpGUYOgCQvAX4PeFdV/SVwG/AjwHYGewjvHceAqur2qpqrqrmZmZlxbFKStIBRzgGQ5HwGH/6/VVWfAKiqp4fm/wbwB+3pMWDL0OqbW42z1CVJEzbKVUAB7gAOV9X7huqXDi32VuDRNr0PuC7Ji5NcBmwFHgAeBLYmuSzJBQxOFO8bTxuSpOUaZQ/gx4CfAr6c5OFW+0Xg7Um2AwUcAX4aoKoOJbmbwcndU8CNVfU8QJKbgHuBDcDeqjo0xl4kScswylVAfwJkgVn7z7LOLcAtC9T3n209SdLk+E1gSeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUqfOmPYBzyeyeexasH7n12gmPRJLWnnsAktQpA0CSOrVkACTZkuTTSR5LcijJz7X6RUkOJHmi/dzY6knywSTzSR5J8pqhbe1qyz+RZNfatSVJWsooewCngJ+vqm3AFcCNSbYBe4D7qmorcF97DvBmYGt77AZug0FgADcDrwUuB24+HRqSpMlbMgCq6nhVfaFNfws4DGwCdgJ3tsXuBN7SpncCH6mB+4ELk1wKXA0cqKqTVfUscADYMdZuJEkjW9Y5gCSzwKuBzwOXVNXxNuvrwCVtehPw1NBqR1ttsbokaQpGDoAkLwF+D3hXVf3l8LyqKqDGMaAku5McTHLwxIkT49ikJGkBIwVAkvMZfPj/VlV9opWfbod2aD+fafVjwJah1Te32mL1v6Gqbq+quaqam5mZWU4vkqRlGOUqoAB3AIer6n1Ds/YBp6/k2QV8cqj+jnY10BXAN9uhonuBq5JsbCd/r2o1SdIUjPJN4B8Dfgr4cpKHW+0XgVuBu5PcAHwNeFubtx+4BpgHvg1cD1BVJ5P8CvBgW+49VXVyLF1IkpZtyQCoqj8BssjsNy6wfAE3LrKtvcDe5QxQkrQ2/CawJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSp0a5HXT3Zvfcs2D9yK3XTngkkjQ+7gFIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnDABJ6pQBIEmdWjIAkuxN8kySR4dqv5zkWJKH2+OaoXnvTjKf5PEkVw/Vd7TafJI9429FkrQco+wBfBjYsUD9/VW1vT32AyTZBlwHvKqt8+tJNiTZAHwIeDOwDXh7W1aSNCVL3g20qj6bZHbE7e0E7qqq7wBfTTIPXN7mzVfVkwBJ7mrLPrbsEUuSxmI15wBuSvJIO0S0sdU2AU8NLXO01RarS5KmZKUBcBvwI8B24Djw3nENKMnuJAeTHDxx4sS4NitJOsOKAqCqnq6q56vqr4Hf4HuHeY4BW4YW3dxqi9UX2vbtVTVXVXMzMzMrGZ4kaQQr+o1gSS6tquPt6VuB01cI7QM+luR9wMuArcADQICtSS5j8MF/HfBvVzPwc4G/KUzSerZkACT5OHAlcHGSo8DNwJVJtgMFHAF+GqCqDiW5m8HJ3VPAjVX1fNvOTcC9wAZgb1UdGns3kqSRjXIV0NsXKN9xluVvAW5ZoL4f2L+s0UmS1ozfBJakThkAktQpA0CSOmUASFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjplAEhSpwwASeqUASBJnTIAJKlTBoAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnq1HnTHsAL0eyeexasH7n12gmPRJIW5x6AJHXKAJCkThkAktQpA0CSOrVkACTZm+SZJI8O1S5KciDJE+3nxlZPkg8mmU/ySJLXDK2zqy3/RJJda9OOJGlUo+wBfBjYcUZtD3BfVW0F7mvPAd4MbG2P3cBtMAgM4GbgtcDlwM2nQ0OSNB1LBkBVfRY4eUZ5J3Bnm74TeMtQ/SM1cD9wYZJLgauBA1V1sqqeBQ7w/aEiSZqglZ4DuKSqjrfprwOXtOlNwFNDyx1ttcXq3yfJ7iQHkxw8ceLECocnSVrKqk8CV1UBNYaxnN7e7VU1V1VzMzMz49qsJOkMKw2Ap9uhHdrPZ1r9GLBlaLnNrbZYXZI0JSsNgH3A6St5dgGfHKq/o10NdAXwzXao6F7gqiQb28nfq1pNkjQlS94LKMnHgSuBi5McZXA1z63A3UluAL4GvK0tvh+4BpgHvg1cD1BVJ5P8CvBgW+49VXXmiWVJ0gRlcAj/3DQ3N1cHDx5c8fqL3ZTtXONN4iSNU5KHqmpuqeX8JrAkdcoAkKROGQCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnlrwdtNbeYnct9S6hktaSewCS1CkDQJI6ZQBIUqcMAEnqlAEgSZ0yACSpUwaAJHXKAJCkThkAktQpA0CSOmUASFKnVnUvoCRHgG8BzwOnqmouyUXAbwOzwBHgbVX1bJIAHwCuAb4NvLOqvrCa13+h8x5BktbSOPYA/nlVba+qufZ8D3BfVW0F7mvPAd4MbG2P3cBtY3htSdIKrcUhoJ3AnW36TuAtQ/WP1MD9wIVJLl2D15ckjWC1AVDAp5I8lGR3q11SVcfb9NeBS9r0JuCpoXWPtpokaQpW+/sA/mlVHUvy94ADSf7P8MyqqiS1nA22INkN8PKXv3yVw5MkLWZVewBVdaz9fAb4feBy4OnTh3baz2fa4seALUOrb261M7d5e1XNVdXczMzMaoYnSTqLFQdAkr+T5KWnp4GrgEeBfcCuttgu4JNteh/wjgxcAXxz6FCRJGnCVnMI6BLg9wdXd3Ie8LGq+p9JHgTuTnID8DXgbW35/QwuAZ1ncBno9at4bUnSKq04AKrqSeBHF6j/BfDGBeoF3LjS15MkjZe/FH4dWuwLYuCXxCSNzltBSFKnDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE55GegLjL9DQNKo3AOQpE4ZAJLUKQNAkjplAEhSpzwJ3AlPDks6k3sAktQpA0CSOmUASFKnDABJ6pQBIEmd8iqgznl1kNQv9wAkqVPuAWhB7hlIL3zuAUhSpwwASeqUh4C0LB4akl443AOQpE5NfA8gyQ7gA8AG4Der6tZJj0Hj556BtP5MNACSbAA+BLwJOAo8mGRfVT02yXFocgwG6dw16T2Ay4H5qnoSIMldwE7AAOiMwSBN36QDYBPw1NDzo8BrJzwGncMWC4aVMEykszvnrgJKshvY3Z4+l+TxVWzuYuDPVz+qdaW3nhftN7864ZFMTm/vMdjzcv3wKAtNOgCOAVuGnm9ute+qqtuB28fxYkkOVtXcOLa1XvTWc2/9gj33YhI9T/oy0AeBrUkuS3IBcB2wb8JjkCQx4T2AqjqV5CbgXgaXge6tqkOTHIMkaWDi5wCqaj+wf0IvN5ZDSetMbz331i/Ycy/WvOdU1Vq/hiTpHOStICSpU+syAJLsSPJ4kvkkexaY/8NJ7kvySJLPJNk8NG9XkifaY9dkR75yK+05yfYkn0tyqM37ycmPfmVW8z63+T+Y5GiSX5vcqFdnlX+3X57kU0kOJ3ksyewkx75Sq+z5P7e/24eTfDBJJjv65UuyN8kzSR5dZH5aL/Ot59cMzRvv51dVrasHg5PHXwH+PnAB8CVg2xnL/A6wq02/Afhom74IeLL93NimN067pzXu+ZXA1jb9MuA4cOG0e1rLnofmfwD4GPBr0+5nEj0DnwHe1KZfAvztafe0lj0D/wT4320bG4DPAVdOu6cRev5nwGuARxeZfw3wh0CAK4DPt/rYP7/W4x7Ad28nUVX/Dzh9O4lh24A/btOfHpp/NXCgqk5W1bPAAWDHBMa8Wivuuar+tKqeaNN/BjwDzExk1KuzmveZJP8IuAT41ATGOi4r7jnJNuC8qjoAUFXPVdW3JzPsVVnN+1zADzAIjhcD5wNPr/mIV6mqPgucPMsiO4GP1MD9wIVJLmUNPr/WYwAsdDuJTWcs8yXgX7XptwIvTfJDI657LlpNz9+V5HIG/1i+skbjHKcV95zkRcB7gV9Y81GO12re51cC30jyiSRfTPJf2s0Xz3Ur7rmqPscgEI63x71VdXiNxzsJi/2ZjP3zaz0GwCh+AfjxJF8EfpzBt42fn+6Q1txZe27/g/gocH1V/fV0hjh2i/X8s8D+qjo6zcGtkcV6Pg94fZv/jxkcUnnnlMY4bgv2nOQVwD9kcEeBTcAbkrx+esNcf865ewGNYJTbSfwZ7X8MSV4C/Ouq+kaSY8CVZ6z7mbUc7JisuOf2/AeBe4BfaruU68Fq3ufXAa9P8rMMjoVfkOS5qvq+E4znmNX0fBR4uL53p93/weD48R2TGPgqrKbn/wDcX1XPtXl/CLwO+F+TGPgaWuzPZPyfX9M+IbKCEyjnMTj5cRnfO2n0qjOWuRh4UZu+BXjP0EmUrzI4gbKxTV807Z7WuOcLgPuAd027j0n1fMYy72T9nARezfu8oS0/057/d+DGafe0xj3/JPBHbRvnt7/n/2LaPY3Y9yyLnwS+lr95EviBVh/759fU/yBW+Id3DfCnDI5l/1KrvQf4l236J4An2jK/Cbx4aN1/D8y3x/XT7mWtewb+HfBXwMNDj+3T7met3+ehbaybAFhtzwx+0dIjwJeBDwMXTLufteyZQej9N+Awg98p8r5p9zJivx9ncM7irxgcx78B+BngZ9r8MPjFWV9p7+Xc0Lpj/fzym8CS1KkX6klgSdISDABJ6pQBIEmdMgAkqVMGgCR1ygCQpE4ZAJLUKQNAkjr1/wHNa9U2GtFvqQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.hist([_[2] for _ in data[::100]], bins=50);" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "The vast majority of these similarity scores are for non matches. Let's zoom into the right side of the distribution." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEIpJREFUeJzt3XuMpXV9x/H3h10u9cptS8guOLTStPQi0i3FWqtAbLlYl7aI2KYudNONERMba+q2/aOpqQm0qaixMd2IdTH1Qq0WolihC8ReBF3kDlUWCmG3CKsCLSW2Yr/94/yos+sMc2bOnDkzv32/kpPzPL/nOed8f/PsfM5vfs9zzqaqkCT164BJFyBJGi+DXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktS51ZMuAODII4+sqampSZchSSvKzTff/I2qWjPXfssi6KemptixY8eky5CkFSXJg8Ps59SNJHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1bll8MnYUU1s+O2P7AxefvcSVSNLy5Ihekjpn0EtS5wx6SercUEGf5IEkdyS5NcmO1nZ4kmuT3NvuD2vtSfK+JDuT3J7kpHF2QJL07OYzoj+1qk6sqvVtfQuwvaqOB7a3dYAzgePbbTPwgcUqVpI0f6NM3WwAtrXlbcA509ovr4EbgUOTHD3C60iSRjBs0BdwTZKbk2xubUdV1cNt+evAUW15LfDQtMfuam17SbI5yY4kO/bs2bOA0iVJwxj2Ovqfr6rdSX4QuDbJv07fWFWVpObzwlW1FdgKsH79+nk9VpI0vKFG9FW1u90/CnwaOBl45JkpmXb/aNt9N3DMtIeva22SpAmYM+iTPDfJ859ZBn4RuBO4CtjYdtsIXNmWrwLe2K6+OQV4YtoUjyRpiQ0zdXMU8Okkz+z/0ar6+yRfBq5Isgl4EDiv7X81cBawE3gKuHDRq5YkDW3OoK+q+4GXzND+TeD0GdoLuGhRqpMkjcxPxkpS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1Lmhgz7JqiS3JPlMWz8uyU1Jdib5RJKDWvvBbX1n2z41ntIlScOYz4j+rcA909YvAS6tqhcDjwGbWvsm4LHWfmnbT5I0IUMFfZJ1wNnAB9t6gNOAT7ZdtgHntOUNbZ22/fS2vyRpAoYd0b8H+D3gf9v6EcDjVfV0W98FrG3La4GHANr2J9r+e0myOcmOJDv27NmzwPIlSXOZM+iTvAZ4tKpuXswXrqqtVbW+qtavWbNmMZ9akjTN6iH2eTnw2iRnAYcALwDeCxyaZHUbta8Ddrf9dwPHALuSrAZeCHxz0SuXJA1lzhF9Vf1+Va2rqingfOC6qvoN4Hrg3LbbRuDKtnxVW6dtv66qalGrliQNbZTr6N8BvC3JTgZz8Je19suAI1r724Ato5UoSRrFMFM3/6+qbgBuaMv3AyfPsM+3gdctQm2SpEXgJ2MlqXMGvSR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXOrJ13AuExt+eyM7Q9cfPYSVyJJk+WIXpI6Z9BLUucMeknqnEEvSZ2bM+iTHJLkS0luS3JXkj9u7ccluSnJziSfSHJQaz+4re9s26fG2wVJ0rMZZkT/38BpVfUS4ETgjCSnAJcAl1bVi4HHgE1t/03AY6390rafJGlC5gz6GniyrR7YbgWcBnyytW8DzmnLG9o6bfvpSbJoFUuS5mWoOfokq5LcCjwKXAvcBzxeVU+3XXYBa9vyWuAhgLb9CeCIxSxakjS8oYK+qr5bVScC64CTgR8d9YWTbE6yI8mOPXv2jPp0kqRZzOuqm6p6HLgeeBlwaJJnPlm7DtjdlncDxwC07S8EvjnDc22tqvVVtX7NmjULLF+SNJdhrrpZk+TQtvwDwKuBexgE/rltt43AlW35qrZO235dVdViFi1JGt4w33VzNLAtySoGbwxXVNVnktwNfDzJnwC3AJe1/S8DPpJkJ/At4Pwx1C1JGtKcQV9VtwMvnaH9fgbz9fu2fxt43aJUJ0kamZ+MlaTOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXPD/McjXZna8tlZtz1w8dlLWIkkLQ1H9JLUOYNekjpn0EtS5wx6SeqcQS9Jndvvrrp5NrNdkePVOJJWMkf0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1zqCXpM4Z9JLUuTmDPskxSa5PcneSu5K8tbUfnuTaJPe2+8Nae5K8L8nOJLcnOWncnZAkzW6YEf3TwO9W1QnAKcBFSU4AtgDbq+p4YHtbBzgTOL7dNgMfWPSqJUlDmzPoq+rhqvpKW/5P4B5gLbAB2NZ22wac05Y3AJfXwI3AoUmOXvTKJUlDmdccfZIp4KXATcBRVfVw2/R14Ki2vBZ4aNrDdrU2SdIEDB30SZ4H/C3wO1X1H9O3VVUBNZ8XTrI5yY4kO/bs2TOfh0qS5mGorylOciCDkP/rqvpUa34kydFV9XCbmnm0te8Gjpn28HWtbS9VtRXYCrB+/fp5vUlIEvjV4sMa5qqbAJcB91TVu6dtugrY2JY3AldOa39ju/rmFOCJaVM8kqQlNsyI/uXAbwJ3JLm1tf0BcDFwRZJNwIPAeW3b1cBZwE7gKeDCRa1YkjQvcwZ9Vf0TkFk2nz7D/gVcNGJdkqRF4idjJalzBr0kdc6gl6TOGfSS1DmDXpI6N9QHpvZ3fihDmqzZfgc1HEf0ktQ5g16SOufUjaTuON26N0f0ktQ5g16SOufUjaT9xv46peOIXpI6Z9BLUucMeknqnHP0kpYNPwE7Ho7oJalzBr0kdc6gl6TOOUcvaWz21+vWlxuDXtJ+r/c3JINe0pLz6pqlZdBLGsqzhXMvI99eeTJWkjpn0EtS55y6GUHvJ3CkYTnnvrw5opekzjmil7QXR+ff08tf7Y7oJalzBr0kdc6gl6TOzTlHn+RDwGuAR6vqJ1rb4cAngCngAeC8qnosSYD3AmcBTwEXVNVXxlP68tXLvJ6kPgwzov8wcMY+bVuA7VV1PLC9rQOcCRzfbpuBDyxOmZKkhZpzRF9VX0gytU/zBuBVbXkbcAPwjtZ+eVUVcGOSQ5McXVUPL1bB0v5usf5i9Oqa/cdC5+iPmhbeXweOastrgYem7bertUmSJmTk6+irqpLUfB+XZDOD6R2OPfbYUcuQNAtH7lroiP6RJEcDtPtHW/tu4Jhp+61rbd+nqrZW1fqqWr9mzZoFliFJmstCR/RXARuBi9v9ldPa35Lk48DPAk84Pz83r9KRNE7DXF75MQYnXo9Msgv4IwYBf0WSTcCDwHlt96sZXFq5k8HllReOoWapK77Ra9yGuermDbNsOn2GfQu4aNSiJDm3rsXjl5pJC+RIXCuFX4EgSZ1zRL8fcOS5tPx5a7kx6LVfMYS1PzLol5An1xZuUgG9ko7ZSqpVS8ugl1iakDSINSkG/TI27lHsswWPUxlSP7zqRpI654heYzXf6Yql+GtF2t8Y9JqRV6dI/TDoJWmeFjIQmuTgyaDXsjLfKRenaKS5GfSaF6d0pJXHoF+BlmPYOrKWli+DviOGraSZeB29JHXOoJekzhn0ktQ5g16SOmfQS1LnvOpGkhbJcr3yzRG9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1zqCXpM6NJeiTnJHkq0l2JtkyjteQJA1n0YM+ySrgL4AzgROANyQ5YbFfR5I0nHGM6E8GdlbV/VX1P8DHgQ1jeB1J0hDGEfRrgYemre9qbZKkCZjY1xQn2QxsbqtPJvnqAp/qSOAbi1PVxNmX5aeXfoB9WZZyyUh9edEwO40j6HcDx0xbX9fa9lJVW4Gto75Ykh1VtX7U51kO7Mvy00s/wL4sV0vRl3FM3XwZOD7JcUkOAs4HrhrD60iShrDoI/qqejrJW4DPA6uAD1XVXYv9OpKk4Yxljr6qrgauHsdzz2Dk6Z9lxL4sP730A+zLcjX2vqSqxv0akqQJ8isQJKlzyzro5/oqhSQvSrI9ye1Jbkiybp/tL0iyK8n7l67q7zdKP5J8N8mt7Tbxk9oj9uXYJNckuSfJ3UmmlrL2fS20L0lOnXZMbk3y7STnLH0P9qp1lOPyp0nuasflfUmytNXvVeco/bgkyZ3t9vqlrfz7JflQkkeT3DnL9rSf987Wn5OmbduY5N522zhyMVW1LG8MTuTeB/wQcBBwG3DCPvv8DbCxLZ8GfGSf7e8FPgq8f6X2A3hy0sdiEftyA/Dqtvw84DkrtS/T9jkc+NZK7Qvwc8A/t+dYBXwReNUK7MfZwLUMzjs+l8HVfy+Y1DFpNf0CcBJw5yzbzwI+BwQ4Bbhp2r+p+9v9YW35sFFqWc4j+mG+SuEE4Lq2fP307Ul+GjgKuGYJan02I/VjmVlwX9r3Ha2uqmsBqurJqnpqacqe0WIdl3OBz63gvhRwCINgPRg4EHhk7BXPbJR+nAB8oaqerqr/Am4HzliCmmdVVV9gMAiYzQbg8hq4ETg0ydHALwHXVtW3quoxBm9gI/VlOQf9MF+lcBvwq235V4DnJzkiyQHAnwNvH3uVc1twP9r6IUl2JLlx0tMDjNaXHwEeT/KpJLck+bP2BXiTMupxecb5wMfGUuHwFtyXqvoig8B8uN0+X1X3jLne2YxyTG4DzkjynCRHAqey9wc3l6PZ+rvoXyOznIN+GG8HXpnkFuCVDD6B+13gzcDVVbVrksXNw2z9AHhRDT419+vAe5L88IRqHNZsfVkNvKJt/xkGf55fMKEah/Vsx4U2+vpJBp8ZWe5m7EuSFwM/xuAT7GuB05K8YnJlzmnGflTVNQwu6f4XBm+8X2TasdrfTey7boYw51cpVNW/097dkzwP+LWqejzJy4BXJHkzg7ngg5I8WVWT+G78Bfejbdvd7u9PcgPwUgbzmJMwyjHZBdxaVfe3bX/HYF7ysqUofAYjHZfmPODTVfWdMdc6l1GOy28DN1bVk23b54CXAf+4FIXvY9TflXcB72rbPgp8bQlqHsVs/d0NvGqf9htGeqVJnqyY40TGagYnIY7jeydmfnyffY4EDmjL7wLeOcPzXMBkT8YuuB8MTsQcPG2fe9nn5NQK6suqtv+atv5XwEUrsS/Ttt8InDqpPizScXk98A/tOQ4EtgO/vAL7sQo4oi3/FHAng3NCkz42U8x+MvZs9j4Z+6XWfjjwb+33/7C2fPhIdUz6BzHHD+ksBu/K9wF/2NreCby2LZ/bwu9rwAefCcV9nuMCJhj0o/SDwRURd7R/8HcAm1byMQFezeAk2R3Ah4GDVnBfphiMvA6Y9DEZ8d/YKuAvgXuAu4F3r9B+HNLqv5vBG/CJy+CYfIzBeY/vMJhn3wS8CXhT2x4G/0nTfe13Yv20x/4WsLPdLhy1Fj8ZK0mdW+knYyVJczDoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknq3P8BpdqoH5C0KWEAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.hist([_[2] for _ in data[::1] if _[2] > 0.94], bins=50);" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "Now it looks like a good threshold should be above `0.95`. Let's have a look at some of the candidate matches around there." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "def sample(data, threshold, num_samples, epsilon=0.01):\n", " samples = []\n", " for row in data:\n", " if abs(row[2] - threshold) <= epsilon:\n", " samples.append(row)\n", " if len(samples) >= num_samples:\n", " break\n", " return samples\n", "\n", "def lookup_originals(candidate_pair):\n", " a = dfA.iloc[candidate_pair[0]]\n", " b = dfB.iloc[candidate_pair[1]]\n", " return a, b" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "def look_at_per_field_accuracy(threshold = 0.999, num_samples = 100):\n", " results = []\n", " for i, candidate in enumerate(sample(data, threshold, num_samples, 0.01), start=1):\n", " record_a, record_b = lookup_originals(candidate)\n", " results.append(record_a == record_b)\n", "\n", " print(\"Proportion of exact matches for each field using threshold: {}\".format(threshold))\n", " print(sum(results)/num_samples)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "So we should expect a very high proportion of matches across all fields for high thresholds:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Proportion of exact matches for each field using threshold: 0.999\n", "given_name 0.93\n", "surname 0.96\n", "street_number 0.88\n", "address_1 0.92\n", "address_2 0.80\n", "suburb 0.92\n", "postcode 0.95\n", "state 1.00\n", "date_of_birth 0.96\n", "soc_sec_id 0.40\n", "dtype: float64\n" ] } ], "source": [ "look_at_per_field_accuracy(threshold = 0.999, num_samples = 100)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": {} }, "source": [ "But if we look at a threshold which is closer to the boundary between real matches we should see a lot more errors:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Proportion of exact matches for each field using threshold: 0.95\n", "given_name 0.49\n", "surname 0.57\n", "street_number 0.81\n", "address_1 0.55\n", "address_2 0.44\n", "suburb 0.70\n", "postcode 0.84\n", "state 0.93\n", "date_of_birth 0.84\n", "soc_sec_id 0.92\n", "dtype: float64\n" ] } ], "source": [ "look_at_per_field_accuracy(threshold = 0.95, num_samples = 100)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.12.0'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }